In this work we aim to solve a large collection of tasks using a single reinforcement learning agent with a single set of parameters. A key challenge is to handle the increased amount of data and extended training time. We have developed a new distributed agent IMPALA (Importance Weighted Actor-Learner Architecture) that not only uses resources more efficiently in singlemachine training but also scales to thousands of machines without sacrificing data efficiency or resource utilisation. We achieve stable learning at high throughput by combining decoupled acting and learning with a novel off-policy correction method called V-trace. We demonstrate the effectiveness of IMPALA for multi-task reinforcement learning on DMLab-30 (a set of 30 tasks from the DeepMind Lab environment (Beattie et al., 2016)) and Atari-57 (all available Atari games in Arcade Learning Environment (Bellemare et al., 2013a)). Our results show that IMPALA is able to achieve better performance than previous agents with less data, and crucially exhibits positive transfer between tasks as a result of its multi-task approach. The source code is publicly available at github.com/deepmind/scalable agent.
translated by 谷歌翻译
Generating realistic motions for digital humans is a core but challenging part of computer animations and games, as human motions are both diverse in content and rich in styles. While the latest deep learning approaches have made significant advancements in this domain, they mostly consider motion synthesis and style manipulation as two separate problems. This is mainly due to the challenge of learning both motion contents that account for the inter-class behaviour and styles that account for the intra-class behaviour effectively in a common representation. To tackle this challenge, we propose a denoising diffusion probabilistic model solution for styled motion synthesis. As diffusion models have a high capacity brought by the injection of stochasticity, we can represent both inter-class motion content and intra-class style behaviour in the same latent. This results in an integrated, end-to-end trained pipeline that facilitates the generation of optimal motion and exploration of content-style coupled latent space. To achieve high-quality results, we design a multi-task architecture of diffusion model that strategically generates aspects of human motions for local guidance. We also design adversarial and physical regulations for global guidance. We demonstrate superior performance with quantitative and qualitative results and validate the effectiveness of our multi-task architecture.
translated by 谷歌翻译
A household robot should be able to navigate to target locations without requiring users to first annotate everything in their home. Current approaches to this object navigation challenge do not test on real robots and rely on expensive semantically labeled 3D meshes. In this work, our aim is an agent that builds self-supervised models of the world via exploration, the same as a child might. We propose an end-to-end self-supervised embodied agent that leverages exploration to train a semantic segmentation model of 3D objects, and uses those representations to learn an object navigation policy purely from self-labeled 3D meshes. The key insight is that embodied agents can leverage location consistency as a supervision signal - collecting images from different views/angles and applying contrastive learning to fine-tune a semantic segmentation model. In our experiments, we observe that our framework performs better than other self-supervised baselines and competitively with supervised baselines, in both simulation and when deployed in real houses.
translated by 谷歌翻译
We propose a novel antialiasing method to increase shift invariance in convolutional neural networks (CNNs). More precisely, we replace the conventional combination "real-valued convolutions + max pooling" ($\mathbb R$Max) by "complex-valued convolutions + modulus" ($\mathbb C$Mod), which produce stable feature representations for band-pass filters with well-defined orientations. In a recent work, we proved that, for such filters, the two operators yield similar outputs. Therefore, $\mathbb C$Mod can be viewed as a stable alternative to $\mathbb R$Max. To separate band-pass filters from other freely-trained kernels, in this paper, we designed a "twin" architecture based on the dual-tree complex wavelet packet transform, which generates similar outputs as standard CNNs with fewer trainable parameters. In addition to improving stability to small shifts, our experiments on AlexNet and ResNet showed increased prediction accuracy on natural image datasets such as ImageNet and CIFAR10. Furthermore, our approach outperformed recent antialiasing methods based on low-pass filtering by preserving high-frequency information, while reducing memory usage.
translated by 谷歌翻译
Planet formation is a multi-scale process in which the coagulation of $\mathrm{\mu m}$-sized dust grains in protoplanetary disks is strongly influenced by the hydrodynamic processes on scales of astronomical units ($\approx 1.5\times 10^8 \,\mathrm{km}$). Studies are therefore dependent on subgrid models to emulate the micro physics of dust coagulation on top of a large scale hydrodynamic simulation. Numerical simulations which include the relevant physical effects are complex and computationally expensive. Here, we present a fast and accurate learned effective model for dust coagulation, trained on data from high resolution numerical coagulation simulations. Our model captures details of the dust coagulation process that were so far not tractable with other dust coagulation prescriptions with similar computational efficiency.
translated by 谷歌翻译
This contribution presents a deep learning method for the extraction and fusion of information relating to kidney stone fragments acquired from different viewpoints of the endoscope. Surface and section fragment images are jointly used during the training of the classifier to improve the discrimination power of the features by adding attention layers at the end of each convolutional block. This approach is specifically designed to mimic the morpho-constitutional analysis performed in ex-vivo by biologists to visually identify kidney stones by inspecting both views. The addition of attention mechanisms to the backbone improved the results of single view extraction backbones by 4% on average. Moreover, in comparison to the state-of-the-art, the fusion of the deep features improved the overall results up to 11% in terms of kidney stone classification accuracy.
translated by 谷歌翻译
个性化移动代理中的感知系统需要开发室内场景理解模型,这些模型可以理解3D几何,捕获客观性,分析人类行为等。但是,与户外环境的模型相比,该方向并未得到充分探索(例如自动驾驶系统,包括行人预测,汽车检测,交通标志识别等)。在本文中,我们首先讨论主要挑战:不足,甚至没有标记为现实世界室内环境的数据,以及其他挑战,例如异质信息来源(例如RGB图像和LIDAR点云)之间的融合,建模关系建模关系在各种输出集(例如3D对象位置,深度估计和人类姿势)和计算效率之间。然后,我们描述MMISM(多模式输入多任务输出室内场景理解模型)来应对上述挑战。 MMISM认为RGB图像以及稀疏的LIDAR点是输入和3D对象检测,深度完成,人体姿势估计和语义分割作为输出任务。我们表明,MMISM在PAR上执行甚至比单任务模型更好。例如,我们在基准Arkitscenes数据集上将基线3D对象检测结果提高了11.7%。
translated by 谷歌翻译
新颖的对象字幕(NOC)旨在描述包含对象的图像,而无需在训练过程中观察其地面真相标题。由于缺乏字幕注释,无法通过序列到序列训练或苹果酒优化直接优化字幕模型。结果,我们提出了启用释义(P2C),这是一个针对NOC的两阶段学习框架,它将通过释义通过释义来优化输出字幕。使用P2C,字幕模型首先从仅在文本语料库中预先训练的语言模型中学习释义,从而扩展了Bank一词以提高语言流利度。为了进一步实施足够描述输入图像的视觉内容的输出字幕,我们对引入的忠诚度和充分性目标进行字幕模型执行自我贴形。由于在训练过程中没有任何地面真相标题可用于新颖的对象图像,因此我们的P2C利用交叉模式(图像文本)关联模块可以确保可以正确保留上述字幕特征。在实验中,我们不仅表明我们的P2C在NOCAPS和COCO字幕数据集上实现了最先进的性能,而且还通过替换NOC的语言和跨模式关联模型来验证学习框架的有效性和灵活性。实施详细信息和代码可在补充材料中找到。
translated by 谷歌翻译
避免碰撞是移动机器人和代理在现实世界中安全运作的关键。在这项工作中,我们提出了一个有效而有效的避免碰撞系统,该系统结合了现实世界增强学习(RL),基于搜索的在线轨迹计划和自动紧急干预,例如自动紧急制动(AEB)。RL的目的是学习有效的搜索启发式方法,以加快寻找无碰撞轨迹的搜索并减少触发自动紧急干预措施的频率。这种新颖的设置使RL能够在现实世界中的室内环境中安全,直接在移动机器人上学习,从而最大程度地减少培训的实际崩溃。我们的现实世界实验表明,与多个基线相比,我们的方法具有更高的平均速度,较低的崩溃率,更高的目标达到速率,较小的计算开销以及整体控制更平滑。
translated by 谷歌翻译
早期预测在临床上被认为是脑瘫(CP)治疗的重要部分之一。我们建议实施一个基于一般运动评估(GMA)的CP预测的低成本和可解释的分类系统。我们设计了一个基于Pytorch的注意力图形卷积网络,以识别从RGB视频中提取的骨骼数据中有CP风险的早期婴儿。我们还设计了一个频率模块,用于在过滤噪声时学习频域中的CP运动。我们的系统仅需要消费级RGB视频进行培训,以通过提供可解释的CP分类结果来支持交互式时间CP预测。
translated by 谷歌翻译